Poodle: Pandas + Sklearn

Pandas is a wonderful framework for data management. Also, Sklearn is a powerful tool for machine learning. However, there is no one which mix them up. Poodle is a package which mixes Pandas and Sklearn in order to give convenience and power at the same time in Data science.

Previously, you should read data from a file first before performing machine learning unless it is generated online. In Poodle, you don't need to read your data using seprate steps. The machine learning tools in Poodle will read data from a file if it is needed. Moreover, your data sheet files are always synchronized with your machine learning operation. Therfore, you can keep monitoring your data during your operation for machine leanring.

This is a basic tutorial illustraiting how to use Poodle although this tutorial is underdevelopment. By reading this document, you can realize what poodle will do. As a data scientist, I feel very tried always when I performance machine learning because of two reasons mainly. First, as I metioned above there should be steps to load data into memory for machine learning. Second, after data is loaded into memeory, it is really hard to monitor the processing of data during performing machine learning. Therefore, you often save your data in each step by yourself. It makes tiresome in you processing. Poodle will be a solution to make this step removal and make that process all automatic.

Below, the simple usage of Poodle is introduced using an example. The example is linear regression with dummy data. In order to test this example by yourself. You need two files, which are poodle codes in poole/linear_model.py and dumy data in sheet/xy_pdl.csv.

Notice that even if the input data file format follows CSV, the actual format is extension of CSV. You should use ID, X, y as special keywords where X and y repreents a feature array and a target array as used in Sklearn. Moreover, ID represents the index column which will be index of DataFrame(). Interestingly, feature names such as x1, x2, x3 below X are not determined so that you can use any words for them. It gives great flexibility in your processing. The y1, y2 below y are the same, you can use 'left', 'right' instead of 'y1', 'y2' acording to your project for running machine learning. Similarly, the index names below ID are also flexible. Now, numeric values sorted are used but you can use any words for them regradless ordering or not.


In [15]:
from importlib import reload

Start from linear regression

Let us start from linear regression which is a simple but widely using machine learning method.

In Poodle, linear_model() can be imported like that in Sklearn. In Sklearn, input and output data are variables while Poodle support a CSV file based on Pandas DataFrame().


In [16]:
from poodle import linear_model

As it is metioned before, the command in Sklearn for LinearRegression can be used except that the input data are not arrays any longer. Instead, they are data in a CSV file. Hence, you can give a file name instead of X, y as arrays.

  • fit() is modified function for special purpose such as loading input data from a file.

In [17]:
ml = linear_model.LinearRegression()
ml.fit('sheet/xy_pdl.csv')


Out[17]:
LinearRegression()

Now every other operations are the same to the commands in orginal LinearRegression method. You can predict for new input data. Now additional input data is not a file. It will be updated to use file later on. After that, you can specify traning data on fit() while testing on predict().

  • predict() is the same function in Sklearn. It is a parenet function by class.

In [18]:
ml.predict( [[1,2,3]])


Out[18]:
array([[  6.,  20.]])

Format for datasheet

In Poodle, some format in a datasheet must be followed. Otherwise, the operation for machine leanring will not be working.

  • To write data, you can refer to example datasheet of sheet/xy_pdl.csv

In [21]:
linear_model.read_csv( 'sheet/xy_pdl.csv')


Out[21]:
X y
x1 x2 x3 y1 y2
id
0 1 2 3 6 20
1 4 5 6 15 47
2 7 8 9 24 74
3 4 5 8 17 55
4 8 9 4 21 59

Later plan

Other functions in LinearRegression() and other tools in Sklearn will be included in Poodle step by step.


In [ ]: